Take-home Exercise 2

Author

Le Thanh Thao

1 Overview

In this take home exercise, we aim to regionalize Nigeria by using, but not limited to the following measures:

  • Total number of functional water points

  • Total number of non-functional water points

  • Percentage of functional water points

  • Percentage of non-functional water points

  • Percentage of main water point technology (i.e. Hand Pump, Mechanized Pump)

  • Percentage of usage capacity (i.e. <1000, >=1000)

  • Percentage of rural water points

2 Data Set

2.1 Aspatial Data

WPdx+ dataset of Nigeria in csv format was downloaded WPdx (Water Point Data Exchange) Global Data Repositories. The dataset contains water points related data from rural areas at the water point or small water scheme level. The dataset will be re-named as geo_export.csv saved under data/aspatial folder.

2.2 Geospatial data

Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data was downloaded from geoBoundaries portal. The file names were “geoBoundaries-NGA-ADM2” with different file formats and saved under data/geospatial folder.

3 Getting Started

The code chunks below install and launch these R packages in R environment:

pacman::p_load(rgdal,spdep,tmap,sf,ggpubr,cluster,factoextra,NbClust,heatmaply,corrplot,psych,tidyverse,funModeling,ClustGeo,GGally)

Below is the explanation for packages used in this Take-home exercise:

  • Spatial data handling

    • sf, rgdal and spdep
  • Attribute data handling

    • tidyverse, especially readr, ggplot2 and dplyr
  • Choropleth mapping

    • tmap
  • Multivariate data visualisation and analysis

    • coorplot, ggpubr, factoextra, GGally and heatmaply
  • Cluster analysis

    • cluster

    • ClustGeo implements a Ward-like hierarchical clustering algorithm including spatial/geographical constraints.

    • NbClust determines the optimal number of clusters in a data set and offer the best clustering scheme from different result to the user.

4 Data Import and Preparation

4.1 Importing Aspatial data into R environment

Since geo_export data set was downloaded in csv format, read_csv() function is used to read and import the geo_export.csv file into R environment.

wp_nga<-read_csv("data/aspatial/geo_export.csv")

Things to learn from the above code chunks:

  • The original file name was called Water_Point_Data_Exchange_-PlusWPdx.csv. However, it has been rename to geo_export.csv for easy encoding.

  • read_csv() of readr package was used instead of read.csv() function.

4.2 Converting data from tibble data frame with Well Known Text (wkt) format into sf data frame

After importing the data into R environment, we examine if the data has been imported correctly by using the list() function of base R.

list(wp_nga)

From the above summary, we can see that the wp_nga object is in tibble data frame format with 95008 rows and 70 columns.

Notice that this newly created data frame has a column called New Georeferenced Column which represents spatial data in text format. This is known as Well Known Text or wkt in short.

Next, we will need to convert this tibble data frame in wkt format into an sf data frame using sf package. We will need to start with deriving Geometry field first.

The below function st_as_sfc() function of sf package derives a new field called Geometry as shown in the below code chunks:

wp_nga$Geometry<-st_as_sfc(wp_nga$`New Georeferenced Column`)

After running the above code chunks, when we open the wp_nga data frame under Environment and scroll to the right, we notice that there is a new field called Geometry added onto the wp_nga data frame.

Next, st_sf()function of sf package will be used to convert the tibble data frame into sf data frame as shown in the below code chunks:

wp_sf<-st_sf(wp_nga,crs=4326)

We can check the summary of wp_sf data frame using the below code chunks with list() function:

list(wp_sf)

Note that we now can see that wp_sf is in simple feature data frame with 95008 rows and 70 columns. The coordinates system used is in wgs84 Geographic Coordinates System which means the geometric data will be in degree decimal.

4.3 Importing Geospatial Data

In the below code chunks we use st_read() function of sf package to read geospatial data file. We also specifically select just the shapeName as well as the Geometry information to be kept and call this new object “nga”.

nga<-st_read(dsn="data/geospatial",layer="geoBoundaries-NGA-ADM2",crs=4326)%>%
  select(shapeName)

Note that nga object is in sf data frame with 774 features.

After importing the nga data frame object, it is good to take a look at the data.

We can notice that some of shapeName are in duplicates even though the coordinates are different.

Therefore, it is crucial for us to rename the location to the correct names.

The below code chunks show the steps to rename the duplicate locations.

First, we use order() function to arrange the shapeName in ascending order.

nga<-(nga[order(nga$shapeName),])

Next, we find out the list of duplicate areas by using the code chunks below. duplicated() function of base R determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and return a logical vector indicating which elements (rows) are duplicates.

duplicate_area<-nga$shapeName[nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)]]
duplicate_area

We then use tmap package the location of each area and Google Map to retrieve the actual name of each location.

We now can access the individual row index of the nga data frame and change shapeName accordingly. Last, we use the length() function to ensure there are no more duplicates.

nga$shapeName[c(94,95,304,305,355,356,519,546,547,693,694)] <- c("Bassa (Kogi)","Bassa (Plateau)",
                                                                               "Ifelodun (Kwara)","Ifelodun (Osun)",
                                                                               "Irepodun (Kwara)","Irepodun (Osun)",
                                                                               "Nassarawa","Obi (Benue)","Obi(Nasarawa)",
                                                                               "Surulere (Lagos)","Surulere (Oyo)")

length((nga$shapeName[ nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)] ]))

Method Reference: Jordan, O. (2022) Geospatial Analytics for Social Good-Understanding Nigeria Water functional and non-functional water point rate. Retrieved from https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#data-wrangling

After changing the shapeName, we can go to the Environment window on the right and double-click on the nga data frame to see the changes. We will notice that we no longer 2 LGA called Bassa. Instead, they have been renamed to Bassa (Kogi) and Bassa (Plateau).

5 Geoprocessing with dplyr and sf

5.1 Ensuring location data accuracy

In the below code chunks, we are going to use a geoprocessing function (or commonly known as GIS analysis) called point-in-polygon overlay to transfer the information from nga data frame into wp_sf data frame.

wp_sf<-st_join(wp_sf,nga)

We try to generate another data frame just to view the contents of two columns: #clean_adm2 and shapeName.

wp_sf1<-select(wp_sf,c(`#clean_adm2`,`shapeName`))

By right, #clean_adm2 should provide the LGA name of the water point located. However, we notice that there are some discrepancies between the #clean_adm2 and shapeName after running the above code chunks. For example, the Cameroon-Nigeria’s territorial dispute area of Bakassi, in which there were supposedly 6 water points, were in fact under Akbapuyo, which is part of Nigeria. We can compare how different the coordinates of Bakassi under nga data frame versus the coordinates of self-claimed Bakassi data frame under wp_sf.

st_join() of sf package has helped us match the location and actual area names of all water points in wp_sf data frame based on provided coordinates under nga data frame to ensure location data accuracy instead of putting trust on the #clean_adm2 declared, which may not be accurate.

Thanks to this, in later part when we need to perform contiguity based method to find out list of neighbors for a particular area unit, we no longer face the issue of zero neighbors. More details will be mentioned in the later part.

5.2 Deriving new variables using dplyr package

In this section, in order to perform our in-depth analysis on LGA level, it is important for us to compute and consolidate new attributes located in each LGA level, such as total number of water points, total number of non-functional water points, total number of functional water points, percentage of non-functional water points and many more attributes at LGA level.

5.2.1 Data Cleaning

First, it is important for us to do some cleaning on the data source to remove NA values and missing values.

We notice that under #status_clean column, there are missing values. Therefore, we will replace all these missing values with “Unknown” values by using mutate() function of dplyr package.

wp_sf<-wp_sf%>%
  mutate(`#status_clean`=replace_na(`#status_clean`,"Unknown"))

We can use the freq() function of funModeling package to generate a frequency table to observe the data inputs under #status_clean and their frequency in terms of percentage.

freq(data=wp_sf,input="#status_clean")

We use the below code chunks to observe the data inputs under #water_tech_category, is_urban, usage_capacity

freq(data=wp_sf,input="#water_tech_category")

From the above frequency table, we notice that there are different types of technology used for water points, such as hand pump, mechanized pump, tapstand and rope and bucket.

freq(data=wp_sf,input="is_urban")

From the above frequency table, we notice the inputs include “TRUE” and “FALSE” values. A huge majority (79.41%) of the inputs were under “FALSE”, which means that majority of water points studied were under rural areas.

freq(data=wp_sf,input="usage_capacity")

From the above frequency table, we notice inputs under column usage_capacity include 50, 250, 300 and 1000.

5.2.2 Creating a data frame for functional water points

From the frequency table above, we can see there are a few different inputs which literally mean “Functional”. They are “Functional”,“Functional but needs repair”,“Functional but not in use”. We will then create a new data frame call wpt_functional to capture all the functional water points across Nigeria by using filter() function of dplyr package on the wp_sf data frame as shown in the below code chunks.

wpt_functional<-wp_sf%>%
  filter(`#status_clean` %in% c("Functional","Functional but needs repair","Functional but not in use"))

5.2.3 Creating a data frame for non-functional water points

Similarly, for non-functional water points, we will also create a new data frame call wpt_nonfunctional to capture all the non-functional water points by using filter() function on the wp_sf data frame as shown in the below code chunks.

wpt_nonfunctional<-wp_sf%>%
  filter(`#status_clean` %in% c("Non-Functional","Non-Functional due to dry season","Abandoned/Decommissioned","Abandoned","Non functional due to dry season"))

5.2.4 Creating a data frame for unknown water points

The below code chunk creates a new data frame called wpt_unknown by filtering “Unknown” water points from wp_sf data frame.

wpt_unknown<-wp_sf%>%
  filter(`#status_clean`=="Unknown")

5.2.5 Creating a data frame for water points with usage capacity of less than 1000

The below code chunk creates a new data frame called uc_lt1000 by filtering water points with usage capacity less than 1000 from wp_sf data frame.

uc_lt1000<-wp_sf%>%
  filter(`usage_capacity`<1000)

5.2.6 Creating a data frame for water points with usage capacity of equal or more than 1000

The below code chunk creates a new data frame called uc_mt1000 by filtering water points with usage capacity equal or more than 1000 from wp_sf data frame.

uc_mt1000<-wp_sf%>%
  filter(`usage_capacity`>=1000)

5.2.7 Creating a data frame for all urban water points

The below code chunks filter the is_urban field from the wp_sf data frame to TRUE and create a new data frame featuring all urban water points:

urban_wpt<-wp_sf%>%
  filter(`is_urban`=="TRUE")

5.2.8 Creating a data frame for all rural water points

The below code chunks filter the is_urban field from the wp_sf data frame to FALSE and create a new data frame featuring all rural water points:

rural_wpt<-wp_sf%>%
  filter(`is_urban`=="FALSE")

5.2.9 Creating a data frame for water points with Hand pump technology

The below code chunks filter the #water_tech_category field from the wp_sf data frame to “Hand Pump” and create a new data frame featuring all water points with Hand Pump technology:

handpump_wpt<-wp_sf%>%
  filter(`#water_tech_category`=="Hand Pump")

5.2.10 Creating a data frame for water points with Mechanized Pump technology

The below code chunks filter the `#water_tech_category` field from the wp_sf data frame to “Mechanized Pump” and create a new data frame featuring all water points with Mechanized Pump technology:

mechanized_wpt<-wp_sf%>%
  filter(`#water_tech_category`=="Mechanized Pump")

5.2.11 Creating a data frame for water points with Tapstand

The below code chunks filter the `#water_tech_category` field from the wp_sf data frame to “Tapstand” and create a new data frame featuring all water points with Tapstand technology:

tapstand_wpt<-wp_sf%>%
  filter(`#water_tech_category`=="Tapstand")

5.2.12 Creating a data frame for water points with Rope and Bucket

The below code chunks filter the `#water_tech_category` field from the wp_sf data frame to “Rope and Bucket” and create a new data frame featuring all water points with Rope and Bucket technology:

randb_wpt<-wp_sf%>%
  filter(`#water_tech_category`=="Rope and Bucket")

We will then use the below code chunks to update the nga_wp with new fields. We will start updating new fields reflecting number of water points in different categories.

nga_wp<-nga %>%
  mutate(`total wpt`=lengths(st_intersects(nga,wp_sf)))%>%
  mutate(`wpt functional`=lengths(st_intersects(nga,wpt_functional)))%>%
  mutate(`wpt nonfunctional`=lengths(st_intersects(nga,wpt_nonfunctional)))%>%
  mutate(`wpt unknown`=lengths(st_intersects(nga,wpt_unknown)))%>%
  mutate(`wpt with capacity <1000`=lengths(st_intersects(nga,uc_lt1000)))%>%
  mutate(`wpt with capacity >=1000`=lengths(st_intersects(nga,uc_mt1000)))%>%
  mutate(`wpt urban`=lengths(st_intersects(nga,urban_wpt)))%>%
  mutate(`wpt rural`=lengths(st_intersects(nga,rural_wpt)))%>%
  mutate(`wpt hand pump`=lengths(st_intersects(nga,handpump_wpt)))%>%
  mutate(`wpt mechanized pump`=lengths(st_intersects(nga,mechanized_wpt)))%>%
  mutate(`wpt tapstand`=lengths(st_intersects(nga,tapstand_wpt)))%>%
  mutate(`wpt rope and bucket`=lengths(st_intersects(nga,randb_wpt)))

In the above code chunks, these are the below operations done:

  • First, st_intersects() function of sf package helps identify the water points under these 4 categories (set in the previous part): total, functional, non-functional and unknown, respectively in each area unit.

  • Next, lengths() function of Base R helps calculate the number of water points for each category: total, functional, non-function and unknown that fall into each area unit.

  • Last, mutate() function helps create the new columns for the newly calculated values and name them as total wpt wpt functional wpt nonfunctional wpt unknown wpt urban wpt rural wpt with capacity <1000 wpt with capacity >=1000 wpt hand pump wpt mechanized pump and so on.

The below code chunks continue updating nga_wp with new fields on the percentage of water points in different categories.

nga_wp<-nga_wp%>%
  mutate(`pct functional`=`wpt functional`/`total wpt`)%>%
  mutate(`pct nonfunctional`=`wpt nonfunctional`/`total wpt`)%>%
  mutate(`pct unknown`=`wpt unknown`/`total wpt`)%>%
  mutate(`pct capacity<1000`=`wpt with capacity <1000`/`total wpt`)%>%
  mutate(`pct capacity>=1000`=`wpt with capacity >=1000`/`total wpt`)%>%
  mutate(`pct urban`=`wpt urban`/`total wpt`)%>%
  mutate(`pct rural`=`wpt rural`/`total wpt`)%>%
  mutate(`pct hand pump`=`wpt hand pump`/`total wpt`)%>%
  mutate(`pct mechanized pump`=`wpt mechanized pump`/`total wpt`)%>%
  mutate(`pct tapstand`=`wpt tapstand`/`total wpt`)%>%
  mutate(`pct rope and bucket`=`wpt rope and bucket`/`total wpt`)

5.3 Filtering out regions without any water points.

In regions without any water points, percentage of different attributes to the total number of water points becomes NaN due to division by 0. These regions are of little value to us as the objective of this study is to focus on areas with water points.

The below code chunks filter regions WITH water points by using filter() function on the total number of water points to be different from 0.

nga_wp<-nga_wp%>%
  filter(`total wpt`!=0)

After running the above code, we will notice the number of observation has been dropped from 774 observations to 761 observations. We have filter out 13 LGA that had no water points, including the Bakassi region mentioned previously. With this, we will no longer have any issue with neighbors without link in the later part as Bakassi is considered part of Cameroon’s territories. We also need to take note that our choropleth maps in subsequent sections will have some missing polygons under those LGA that did not have water points.

Next, we can safely save this output data frame in rds format for future processing instead of having to rerun the huge geo_export.csv file and previous code chunks by using write_rds()function of readr package.

write_rds(nga_wp,"data/aspatial/nga_wp.rds")

The newly created nga_wp.rds is only 2.1MB, which makes it easier for us to push through GitHub later.

6 Exploratory Data Analysis (EDA)

In the below code chunks, we read and import the nga_wp.rds file into R environment by using read_rds() function of readr package.

nga_wp<-read_rds("data/aspatial/nga_wp.rds")

6.1 EDA using statistics graphics

First, we can explore the distribution of individual variable (in this example, percentage of non-functional water points) by using ggplot() function combined with geom_histogram() of ggplot2 package in the below code chunks:

ggplot(data=nga_wp,aes(x=`pct nonfunctional`))+
  geom_histogram(bins=20,color="black",fill="light blue")

We can also use box plot to plot the distribution of percentage of non-functional water points by using the below code chunks. The code chunks are quite similar with the previous one except that we change from geom_histogram() to geom_boxplot().

ggplot(data=nga_wp,aes(x=`pct nonfunctional`))+
  geom_boxplot(color="black",fill="light blue")

From the above histogram and box plot, we notice that the distribution of percentage of non-functional water points resembles a right-skewed distribution. There was an outlier with 1, which literally meant that for this LGA 100% of its water points were non-functional. This is quite a worrying sight.

Instead of plotting individual graphs for individual variable, we can plot multiple individual histograms and then group these histograms. In the below code chunks, we plot multiple individual histograms first.

pct_functional<-ggplot(data=nga_wp,aes(x=`pct functional`))+
  geom_histogram(bins=20,color="black",fill="light blue")

pct_nonfunctional<-ggplot(data=nga_wp,aes(x=`pct nonfunctional`))+
  geom_histogram(bins=20,color="black",fill="light blue")

pct_caplt1000<-ggplot(data=nga_wp,aes(x=`pct capacity<1000`))+
  geom_histogram(bins=20,color="black",fill="light blue")

pct_capmt1000<-ggplot(data=nga_wp,aes(x=`pct capacity>=1000`))+
  geom_histogram(bins=20,color="black",fill="light blue")

pct_rural<-ggplot(data=nga_wp,aes(x=`pct rural`))+
  geom_histogram(bins=20,color="black",fill="light blue")

pct_urban<-ggplot(data=nga_wp,aes(x=`pct urban`))+
  geom_histogram(bins=20,color="black",fill="light blue")

pct_handpump<-ggplot(data=nga_wp,aes(x=`pct hand pump`))+
  geom_histogram(bins=20,color="black",fill="light blue")

pct_mechanized<-ggplot(data=nga_wp,aes(x=`pct mechanized pump`))+
  geom_histogram(bins=20,color="black",fill="light blue")

We then use ggarrange()function of ggpubr package to group these histograms together. In this example, we can consider splitting the histogram arrangement into 2 columns with 4 rows by specifying “ncol” argument to 2 and “nrow” to 4.

ggarrange(pct_functional,pct_nonfunctional,pct_caplt1000,pct_capmt1000,pct_rural,pct_urban,pct_handpump,pct_mechanized,ncol=2,nrow=4)

There are a few observations that we can notice from the above histograms:

  • Distribution of percentage of functional water points resembled a normal distribution. However, statistical test needs to be conducted to prove normality.

  • Distribution of percentage of non-functional water points follows a right-skewed distribution.

  • There were more LGA with lower percentage of water points with capacity of equal or more than 1000 than LGA with higher percentage of water points with capacity of equal or more than 1000.

  • The number of LGA with more than 50% of water points developed with mechanized pump technology were lower than the number of LGA with less than 50% of water points developed with mechanized pump technology.

  • Most of the water points in Nigeria were located in rural areas. The distribution of percentage of rural water points were extremely left skewed with more LGA having more than 75% water points located under rural areas.

6.2 Preparing choropleth maps

We can explore the distribution of different variables such as number of non-functional water points, percentage of non-functional water points in Nigeria at LGA level, and other attributes by plotting choropleth maps.

The below code chunk is used to prepare the choropleth maps by using qtm() function of tmap package. We can plot choropleth maps based on distribution of percentage of non-functional water points and number of non-functional water points.

qtm(nga_wp,"wpt nonfunctional")

qtm(nga_wp,"pct nonfunctional")

In order to reveal how the distribution of number of non-functional water points are biased to the underlying total number of water points, we can plot these two choropleth maps side by side. We can also add in the distribution of the percentage of non-functional water points for comparison as well.

total_wpt.map <- tm_shape(nga_wp) + 
  tm_fill(col = "total wpt",
          n = 5,
          style = "jenks", 
          title = "Total water points") + 
  tm_borders(alpha = 0.5) 

wpt_nonfunctional.map <- tm_shape(nga_wp) + 
  tm_fill(col = "wpt nonfunctional",
          n = 5,
          style = "jenks",
          title = "Number of Nonfunctional water points ") + 
  tm_borders(alpha = 0.5) 

pct_nonfunctional.map <- tm_shape(nga_wp) + 
  tm_fill(col = "pct nonfunctional",
          n = 5,
          style = "jenks",
          title = "Percentage of Nonfunctional water points ") + 
  tm_borders(alpha = 0.5) +
  tm_view(set.zoom.limits = c(11,14))

tmap_arrange(total_wpt.map, wpt_nonfunctional.map, pct_nonfunctional.map,
             asp=NA, ncol=2,sync=TRUE)

Notice how LGA with less number of total water points are also showing less number of non-functional water points. Therefore, in our further analysis, we will be dropping these attributes such as total number of non-functional water points, total number of functional water points, total number of water points with hand pumps, total number of rural water points, etc. as they are not meaningful for our analysis due to bias towards underlying total number of water points.

We also notice that for many regions (such as LGA under the south-west Nigeria), even though the number of non-functional water points were relatively low, their respective percentage of non-functional water points can be very high. Water resources were scarce in these regions but at the same time, they were not functional.

In the below code chunks, we will plot the choropleth maps showing the distribution of total number of water points, percentage of non-functional water points, percentage of rural water points and percentage of water points with hand pumps. The reason why these 3 variables were chosen will be revealed in later part.

tm_shape(nga_wp)+
  tm_polygons(c("total wpt","pct nonfunctional","pct rural","pct hand pump"),style="jenks")+
  tm_facets(sync=TRUE,ncol=2)+
  tm_legend(legend.position=c("right","bottom"))+
  tm_layout(outer.margins = 0,asp=0)

Overall, majority of Nigeria LGA have very high percentage of rural water points. In the south-west and south-south region (Niger Delta region) of Nigeria, we notice a worrying trend that although these LGA had less number of water points, the percentage of non-functional water points tend to be higher.

7 Correlation Analysis

In the below code chunks, we use corrplot.mixed() function of corrplot package to visualize and analyze the correlation of the input variables.

nga_wp1<-nga_wp%>%
  st_set_geometry(NULL)
cluster_vars.cor<-cor(nga_wp1[,14:24])
corrplot.mixed(cluster_vars.cor,lower="ellipse",upper="number",tl.pos="lt",diag="l",tl.col="black",number.cex=1.3,tl.cex=1.3,cl.cex=1.3)

In order to have a neater and tidier correlation plot matrix, we can add in such arguments as “number.cex” to adjust the font size of the correlation on the upper right corner, “tl.cex” to adjust the font size of the text labels (i.e . variable names) and “cl.cex” to adjust the font size of the color-legend.

From the above scatterplot matrix, we can notice the perfect negative correlation between percentage of water points in rural area and percentage of water points in urban area (which fits our understanding that an area can only be categorized as either urban or rural, but cannot be both), percentage of water points with usage capacity <1000 and percentage of water points with usage capacity >=1000, percentage of water points with usage capacity <1000 and percentage of mechanized pump. In other words, modern mechanized pumps have higher usage capacity of at least 1000, while traditional hand pump can only provide usage capacity of up to 300 in the case of Nigeria.

At the same time, we also observe high correlation between percentage of functional and non-functional water point. This also fits our understanding that a water point should only be classified as function, non-functional or unknown.

Therefore, we can drop attribute such as pct urban (percentage of water points in urban area), pct mechanized (percentage of water points with mechanized pumps), pct capacity <1000 (percentage of water points with usage of less than 1000), pct capacity >=1000 (percentage of water points with usage capacity of at least 1000) as well as pct functional (percentage of functional water points). In fact, those attributes that we keep for our cluster analysis are in line with the objective of our study and research. We would like to solve problems related to non-functional water points, water points with outdated technology, water points in rural area where the population may have difficulties accessing quality source of water. This is the objective of what we would like to achieve through conducting this study and analysis.

8 Hierarchy Cluster Analysis

8.1 Extracting clustering variables

The code chunks below are used to extract the clustering variables from the nga_wp simple feature object into data frame.

cluster_vars<-nga_wp%>%
  st_set_geometry(NULL)%>%
  select("shapeName","pct nonfunctional","pct rural","pct hand pump")
head(cluster_vars,10)
        shapeName pct nonfunctional  pct rural pct hand pump
1       Aba North         0.5294118 0.00000000    0.11764706
2       Aba South         0.4929577 0.05633803    0.09859155
3           Abaji         0.5964912 0.84210526    0.40350877
4            Abak         0.5208333 0.83333333    0.08333333
5       Abakaliki         0.1802575 0.87553648    0.43776824
6  Abeokuta North         0.4411765 0.20588235    0.14705882
7  Abeokuta South         0.2773109 0.00000000    0.16806723
8             Abi         0.4078947 0.95394737    0.59868421
9     Aboh-Mbaise         0.3939394 0.72727273    0.01515152
10     Abua/Odual         0.3333333 0.53846154    0.30769231

Notice that the final clustering variables list exclude those variables that we have removed in the previous section: pct functional, pct urban, pct mechanized pump, pct capacity <1000, pct capacity >=1000. We will only keep these below clustering variables:

  • shapeName

  • pct nonfunctional

  • pct rural

  • pct hand pump

Next, we need to change the rows by Shape Name (shapeName) instead of by row number to ensure that each individual row will be individual object with individual name.

row.names(cluster_vars)<-cluster_vars$shapeName
head(cluster_vars,10)
                    shapeName pct nonfunctional  pct rural pct hand pump
Aba North           Aba North         0.5294118 0.00000000    0.11764706
Aba South           Aba South         0.4929577 0.05633803    0.09859155
Abaji                   Abaji         0.5964912 0.84210526    0.40350877
Abak                     Abak         0.5208333 0.83333333    0.08333333
Abakaliki           Abakaliki         0.1802575 0.87553648    0.43776824
Abeokuta North Abeokuta North         0.4411765 0.20588235    0.14705882
Abeokuta South Abeokuta South         0.2773109 0.00000000    0.16806723
Abi                       Abi         0.4078947 0.95394737    0.59868421
Aboh-Mbaise       Aboh-Mbaise         0.3939394 0.72727273    0.01515152
Abua/Odual         Abua/Odual         0.3333333 0.53846154    0.30769231

Now the row numbers have been changed to Shape Name.

However, we need to delete the Shape Name field using the code chunks below:

nga_att<- select(cluster_vars, c(2:4))
head(nga_att, 10)
               pct nonfunctional  pct rural pct hand pump
Aba North              0.5294118 0.00000000    0.11764706
Aba South              0.4929577 0.05633803    0.09859155
Abaji                  0.5964912 0.84210526    0.40350877
Abak                   0.5208333 0.83333333    0.08333333
Abakaliki              0.1802575 0.87553648    0.43776824
Abeokuta North         0.4411765 0.20588235    0.14705882
Abeokuta South         0.2773109 0.00000000    0.16806723
Abi                    0.4078947 0.95394737    0.59868421
Aboh-Mbaise            0.3939394 0.72727273    0.01515152
Abua/Odual             0.3333333 0.53846154    0.30769231

8.2 Computing proximity matrix

The code chunk below is used to compute the proximity matrix using Euclidean method specified in the “method” argument.

proxmat<-dist(nga_att,method="euclidean")

We then can use the below code chunk to list the content of proxmat for visual inspection. However, do note that the matrix is extremely long as we have 761 observations.

proxmat

8.3 Computing hierarchical clustering

The below code chunk uses hclust()function of stats package to perform hierarchical cluster analysis using ward.D method and proximity matrix. Note that the hierarchical cluster will be stored in an object of class hclust which describes the tree produced by the clustering process.

hclust_ward<-hclust(proxmat, method="ward.D")

We will then use plot() function of R graphics to plot the tree as shown in the below code chunk.

plot(hclust_ward,cex=0.1,hang=-1)

Ward’s minimum variance is one of the most common types of methods. It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

8.4 Selecting the optimal clustering algorithm

In this section, we will use agnes() function of cluster package to identify stronger clustering structure. agnes() function functions like hclust(), however, it can compute the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure.)

The below code chunks are used to compute the agglomerative coefficients of all hierarchical clustering algorithms.

m<-c("average","single","complete","ward")
names(m)<-c("average","single","complete","ward")
ac<-function(x){agnes(nga_att,method=x)$ac}
map_dbl(m,ac)
  average    single  complete      ward 
0.9371130 0.8674494 0.9670278 0.9950800 

In the above code chunks, we notice there are a few operations:

  • names() function is used to set the name of the objects; in this case, the names of the 4 methods are used to name the object.

  • function() is used to declare a function. It works like a loop in the case as agnes() function computes the agglomorative coefficients of the four methods sequentially.

  • map_dbl()function of purr package is used to return double vectors. In this case the four methods and their respective agglomerative coefficients.

Regarding the above output, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed based on the calculated agglomerative coefficient of 0.99508 which is the closest to 1 among four methods. Therefore, in the subsequent analysis, only Ward’s method will be used.

8.5 Determining Optimal Clusters

In this section, we will use Gap Statistics covered in class to the determine the optimal clusters.

8.5.1 Gap statistics method

The below code chunks use clusGap()function of cluster package to compute the gap statistics. Note that “firstmax” method gives the location of the first local maximum.

set.seed(12345)
gap_stat<-clusGap(nga_att,FUN=hcut,nstart=25,K.max=10,B=50)
print(gap_stat,method="firstmax")
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = nga_att, FUNcluster = hcut, K.max = 10, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstmax'): 4
          logW   E.logW       gap     SE.sim
 [1,] 4.793881 5.181868 0.3879874 0.01022743
 [2,] 4.548543 5.003795 0.4552520 0.01753952
 [3,] 4.382802 4.890738 0.5079360 0.01604334
 [4,] 4.262997 4.792929 0.5299313 0.01966354
 [5,] 4.205561 4.717804 0.5122433 0.01952836
 [6,] 4.143917 4.651981 0.5080638 0.01875618
 [7,] 4.082583 4.592873 0.5102901 0.01822159
 [8,] 4.025371 4.540953 0.5155820 0.01655913
 [9,] 3.987533 4.495500 0.5079672 0.01536691
[10,] 3.957832 4.454777 0.4969448 0.01483307

Next, we can visualize the plot by using fviz_gap_stat() of factoextra package.

fviz_gap_stat(gap_stat)

With reference to the gap statistics graph above, the recommended number of clusters is 4 as it has the largest gap statistics of 0.5299313. This means that with number of clusters of 4, the clustering structure is furthest away from the random uniform distribution of points.

8.6 Re-plotting the dendrograms with borders around the selected clusters

In the below code chunks, we use rect.hclust() function of R stats to draw a dendrogram with borders around the selected clusters. The argument border is used to specify the border colors for the rectangles.

plot(hclust_ward,cex=0.1)
rect.hclust(hclust_ward,k=4,border=2:5)

8.7 Visually-driven hierarchical clustering analysis

8.7.1 Transforming the data frame into a matrix

The data was loaded into a data frame but it has to be converted into a matrix in order to make a heatmap.

The below code chunks use data.matrix() function of base R to transform nga_att data frame into a data matrix.

nga_att_mat<-data.matrix(nga_att)

8.7.2 Plotting interactive cluster heatmap using heatmaply()

In the code chunks below, heatmaply() function of heatmaply package is used to build an interactive cluster heatmap.

heatmaply(normalize(nga_att_mat),Colv=NA,dist_method = "euclidean",hclust_method = "ward.D",seriate="OLO",colors=Blues,k_row=4,margins=c(NA,200,60,NA),fontsize_row = 2,fontsize_col = 8,main="Geographic Segmentation of Nigeria by Water points breakdown",xlab="Water points",ylab="Shape Name")

From the above graph, we can see some characteristics of individual clusters. For example, purple cluster had very high percentage of rural water points and high percentage of water points with hand pumps. The cluster in gold/yellow color, on the other hand, had very high percentage of rural water points while the percentage of water points with hand pumps were very low. Teal/green cluster had relatively high percentage of water points with hand pumps and moderately high percentage of non-functional water points. Cluster in pink had relatively low percentage of water points across all three attributes: percentage of rural water points, percentage of non-functional and percentage of water points with hand pumps.

8.8 Mapping the clusters formed

After closed examination of the dendrogram above, we can stick to the number of clusters of 4.

The below code chunk used cutree() function of R stats to derive a 4-cluster model and as.factor() function to convert the passed objects into a factor.

groups<-as.factor(cutree(hclust_ward,k=4))

The output is called groups. It is a list object.

In order to visualize the clusters, the groups object needs to be appended onto nga_wp simple feature object.

nga_wp_cluster<-cbind(nga_wp,as.matrix(groups))%>%
  rename(`CLUSTER`=`as.matrix.groups.`)

The code chunks above perform 3 operations to form the join:

  • as.matrix() function converts the groups list object into a matrix; and

  • cbind() function of base R appends groups matrix after conversion onto the nga_wp object by columns to produce a simple feature object called nga_wp_cluster; and

  • rename() function of dplyr package renames as.matrix.groups. field to CLUSTER.

Next, we can use qtm() function of tmap package to plot the choropleth map of the non-spatially constrained hierarchical clusters formed.

qtm(nga_wp_cluster,"CLUSTER")

However, the choropleth map above reveals the clusters are very fragmented. This is one of major limitation when non-spatial clustering algorithm is used.

9 Spatially Constrained Clustering: SKATER approach

9.1 Converting SimpleFeature(sf) data frame into SpatialPolygonsDataFrame

Note that if we would like to use SKATER approach to find spatially constrained clusters, it is important to convert the existing sf data frame into SpatialPolygonsDataFrame as skater() function of spdep package only supports sp objects such as SpatialPolygonsDataFrame.

The code chunk below uses as_Spatial() function of sf package to convert nga_wp into a SpatialPolygonsDataFrame called nga_wp.

nga_wp_sp<-as_Spatial(nga_wp)

9.2 Computing Neighbor List

In the below code chunks, we will use poly2nb() function of spdep package to derive the neighbor list from polygon list.

nga.nb<-poly2nb(nga_wp_sp)
summary(nga.nb)
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 4348 
Percentage nonzero weights: 0.750793 
Average number of links: 5.713535 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  4  16  57 122 177 137 121  71  39  11   4   1   1 
4 least connected regions:
136 497 513 547 with 1 link
1 most connected region:
496 with 14 links

From the above table summary, notice that the average number of links for each LGA is 5.713535. The least connected regions have at least 1 neighbor and the most connected region with ID 496 has 14 neighbors.

We can plot the neighbor list on nga_wp_sp by using the code chunks below. We will have to plot the LGA boundaries first. It is then followed by the plot of the neighbor list object, with coordinates applied to the original SpatialPolygonDataFrame to extract the centroids of the polygons. These are used as the nodes for the graph representation. We also set the link color to blue and specify add=TRUE to plot the network on top of the boundaries.

plot(nga_wp_sp,border=grey(.5))
plot(nga.nb,coordinates(nga_wp_sp),col="blue",add=TRUE)

Note that we should map the boundaries first and follow by the network as some of the areas may be clipped if we swap the order of plotting.

9.3 Computing minimum spanning tree

9.3.1 Calculating edge costs

In the below code chunks, nbcosts() function of spdep package is used to compute the cost of each edge, which is in fact the distance between its nodes. This function computes the distance using a data frame with observation vector in each node.

lcosts<-nbcosts(nga.nb,nga_att)

Next, we will incorporate these costs into a list weights object (the same way as we did in the calculation of inverse distance weights) by specifying the newly derived lcosts as the weights. In order to do this, we will use nb2listw() function of spdep package shown in the below code chunk. Note how we specify the “style” argument as Binary to make sure the cost values are not row-standardized.

nga.w<-nb2listw(nga.nb,lcosts,style="B")
summary(nga.w)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 4348 
Percentage nonzero weights: 0.750793 
Average number of links: 5.713535 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  4  16  57 122 177 137 121  71  39  11   4   1   1 
4 least connected regions:
136 497 513 547 with 1 link
1 most connected region:
496 with 14 links

Weights style: B 
Weights constants summary:
    n     nn       S0       S1       S2
B 761 579121 1649.386 1773.197 17984.41

9.3.2 Computing minimum spanning tree

We find the minimum spanning tree by using mstree() function of spdep package as shown in the below code chunk:

nga.mst<-mstree(nga.w)

The value of nga.mst will be a matrix with n-1 rows and three columns with 2 nodes and the cost, i.e. the edge and its cost.

After computing the the MST, we can use class() function of base R to check the class and dim() to check the dimensions of the newly created object nga.mst.

class(nga.mst)
[1] "mst"    "matrix"
dim(nga.mst)
[1] 760   3

Note that the dimension is 760 and not 761. This is because the minimum spanning tree consists on n-1 edges (links) in order to traverse all the nodes.

We can display the content of nga.mst by using head() as shown in the code chunk below.

head(nga.mst)
     [,1] [,2]       [,3]
[1,]  358  515 0.22393951
[2,]  358  359 0.23253881
[3,]  359  332 0.14706995
[4,]  332  101 0.15328068
[5,]  101   65 0.04027682
[6,]  101  362 0.18619412

Similar to the previous section, we will also plot the LGA boundaries first before plotting the MST. We can see how the initial neighbor list has been simplified to just one edge connecting each of the nodes, while parsing though all the nodes.

plot(nga_wp_sp,border=grey(.5))
plot.mst(nga.mst,coordinates(nga_wp_sp),col="blue",cex.lab=0.2,cex.circles=0.06,add=TRUE)

9.4 Computing spatially constrained clusters using SKATER method

The below code chunks compute the spatially constrained cluster using skater() function of spdep package.

clust4<-skater(edges=nga.mst[,1:2],data=nga_att,method="euclidean",ncuts=3)

Note that there are three mandatory arguments of this skater() function:

  • edges: The first two columns of the MST matrix, i.e. the 2 nodes, not the cost.

  • data: The data frame with data observed over nodes (to update the costs as units are being grouped).

  • ncuts: The number of cuts, which is set to be number of clusters -1.

The result of skater() is an object of class skater. We can examine its contents by using the below code chunk.

str(clust4)
List of 8
 $ groups      : num [1:761] 1 1 1 1 1 3 3 1 1 1 ...
 $ edges.groups:List of 4
  ..$ :List of 3
  .. ..$ node: num [1:300] 194 359 559 334 605 715 362 566 573 101 ...
  .. ..$ edge: num [1:299, 1:3] 359 359 715 362 538 332 101 605 201 566 ...
  .. ..$ ssw : num 121
  ..$ :List of 3
  .. ..$ node: num [1:311] 237 411 123 667 636 480 479 760 112 651 ...
  .. ..$ edge: num [1:310, 1:3] 406 635 86 58 253 482 130 439 718 112 ...
  .. ..$ ssw : num 109
  ..$ :List of 3
  .. ..$ node: num [1:130] 496 627 298 172 631 118 544 681 537 70 ...
  .. ..$ edge: num [1:129, 1:3] 294 295 70 176 270 537 289 451 28 542 ...
  .. ..$ ssw : num 49.2
  ..$ :List of 3
  .. ..$ node: num [1:20] 326 300 320 610 503 21 48 453 680 61 ...
  .. ..$ edge: num [1:19, 1:3] 300 326 680 300 326 320 21 453 503 50 ...
  .. ..$ ssw : num 4.22
 $ not.prune   : NULL
 $ candidates  : int [1:4] 1 2 3 4
 $ ssto        : num 350
 $ ssw         : num [1:4] 350 316 295 283
 $ crit        : num [1:2] 1 Inf
 $ vec.crit    : num [1:761] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "class")= chr "skater"

The above list structure shows us the group vector containing the labels of the cluster to which each LGA belong to. We notice that based on the order shown above, the first cluster has 311 LGA, second 300, third 130 and last 20, which sum up to be 761 LGA that we are studying on. Sum of squares measures are given as ssto for the total and ssw to show the effects of each cuts to the overall criterion.

We can check the cluster assignment by using the below code chunks.

ccs4 <- clust4$groups
ccs4
  [1] 1 1 1 1 1 3 3 1 1 1 1 3 1 3 3 3 1 1 3 1 4 1 2 1 1 1 3 3 3 1 4 2 1 3 2 1 3
 [38] 3 1 3 1 1 3 3 1 2 2 4 2 4 1 1 1 1 1 1 1 2 1 1 4 2 2 2 1 3 1 2 3 3 3 2 2 2
 [75] 1 1 1 1 1 2 3 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 1 1 1 2 3 2 2 2 2 2 2 2
[112] 2 2 2 2 1 1 3 1 1 2 3 2 1 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 1 1 2 2 2 1 2 2 1 2 2 2 1 1 3 3 3 3 3 3 4 3 3 1 1 3 1 1 3 3 3
[186] 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 3 1 1 1 1 2 2 2 2 2 2
[223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2
[260] 2 2 1 1 2 2 2 2 2 3 3 3 3 3 1 3 3 3 1 1 1 2 1 1 3 1 1 1 1 3 3 4 3 3 3 3 3
[297] 3 3 3 4 2 1 1 1 1 1 1 1 1 3 3 3 3 3 1 1 1 2 1 4 3 3 3 1 1 4 1 1 1 1 1 1 3
[334] 1 3 3 3 3 2 3 3 3 3 2 1 3 1 3 3 3 3 3 2 3 3 1 1 1 1 1 3 1 3 1 1 1 2 3 1 1
[371] 2 3 1 2 2 2 2 2 2 2 2 2 2 1 1 1 3 2 2 2 2 2 2 2 2 2 3 2 2 2 1 2 1 2 2 2 2
[408] 2 1 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 2 1 1 2 1 2 4 2 2 1 2 2 2 2 2 2
[445] 1 2 1 2 2 1 3 4 4 2 1 1 3 2 3 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2
[482] 2 2 2 2 2 1 1 2 2 1 2 2 1 3 3 2 3 3 2 2 2 4 2 2 2 2 1 1 1 1 1 2 2 1 2 2 1
[519] 1 1 1 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 3 1 1 1 1 3 1 3 1 1 3 1 1 1 1 1 3 3 3
[556] 1 1 1 1 1 1 1 1 1 1 1 4 1 3 3 3 1 1 1 1 1 1 3 1 3 3 3 1 1 3 3 3 1 1 1 1 1
[593] 1 3 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 4 1 3 1 1 1 1 1 1 1 1 3 3 1 1 3 3 3 2 1
[630] 1 3 1 2 1 2 2 2 3 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 4 2 2 2 2 2 1 2 2
[667] 2 2 4 2 2 2 2 2 1 2 1 2 2 4 3 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1
[704] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1 2 2 1 1 1 2 1 2 2 2 2
[741] 2 3 3 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2

We also can find out how many LGA are in each cluster by means of the table command as shown in the below code chunks.

table(ccs4)
ccs4
  1   2   3   4 
300 311 130  20 

Notice the number of LGA for each cluster matched our observation stated above.

Lastly, we will plot the pruned tree that shows the four clusters on top of the boundaries map.

plot(nga_wp_sp,border=gray(.5))
plot(clust4,coordinates(nga_wp_sp),cex.lab=.2,groups.colors=c("red","green","blue","hot pink"),cex.circles=0.06,add=TRUE)

9.5 Visualizing the clusters in choropleth map

The code chunks below are used to plot the newly derived clusters by using SKATER method.

groups_mat<-as.matrix(clust4$groups)
nga_sf_spatialcluster<-cbind(nga_wp_cluster,as.factor(groups_mat))%>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(nga_sf_spatialcluster,"SP_CLUSTER")

For easy comparison, it will be better to place both the hierarchical clustering and spatially constrained hierarchical clustering maps by Skater method next to each other. In the below code chunks, we will be plotting an interactive map. Note that tmap_mode() will need to be set to “view” option.

tmap_mode("view")
hclust.map<-qtm(nga_wp_cluster,"CLUSTER")+
  tm_borders(alpha=0.5)+
  tm_view(set.zoom.limits = c(5,10))
shclust.map<-qtm(nga_sf_spatialcluster,"SP_CLUSTER")+
  tm_borders(alpha=0.5)+
  tm_view(set.zoom.limits = c(5,10))
tmap_arrange(hclust.map,shclust.map,asp=NA,ncol=2,sync=TRUE)

We then need to set the option of tmap_mode() back to “plot”.

tmap_mode("plot")

We notice that compared to those of the non-spatially constrained clustering method, LGA that had similar attributes and were spatially autocorrelated have been grouped together and no longer fragmented under the application of spatially constrained clustering method. Spatially constrained clustering method considers homogeneity in terms of both attributes and spatial autocorrelation while non-spatially constrained clustering only considers homogeneity in terms of attributes.

9.6 Visualizing individual clustering variable

9.6.1 Multivariate Visualization

In the code chunks below, we use ggparcoord() function GGally package to plot parallel coordinate plot to reveal clustering variables by cluster. Do note that the data used is nga_wp_cluster, which contains data on non-spatially constrained hierarchical clustering using skater approach.

ggparcoord(data = nga_wp_cluster, 
           columns = c(15,20,21), 
           scale = "globalminmax",
           alphaLines = 0.1,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of Pct of Water points Variables by Non-Spatially Constrained Cluster") +
  facet_grid(~ CLUSTER) + 
  theme(axis.text.x = element_text(angle = 90))

From the above plot, we observe the following points:

  • Cluster 3 had the lowest percentage of water points with hand pumps while having very high percentage of rural water points.

  • Cluster 4 tends to have the highest percentage of rural water points (except for some outliers hovering at around 0.5 mark) and highest percentage of water points with hand pumps while having the lowest percentage of non-functional water points.

  • Cluster 1’s observations tend to have the lowest percentage of rural water points while having relatively low median percentage of non-functional water points and water points with hand pumps. However, we also notice that the range of percentage of non-functional water points and percentage of water points with hand pumps are very wide, ranging from 0 to 1.

We also use similar code chunks to plot parallel coordinate plot to reveal clustering variables by spatially constrained clusters derived using skater method. Do note that the data used is nga_sf_spatialcluster with SP_CLUSTER as facet_grid argument.

ggparcoord(data = nga_sf_spatialcluster, 
           columns = c(15,20,21), 
           scale = "globalminmax",
           alphaLines = 0.1,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of Pct of Water points Variables by Spatially Constrained Cluster") +
  facet_grid(~ SP_CLUSTER) + 
  theme(axis.text.x = element_text(angle = 90))

Below are some of the observations:

  • Notice that due to the spatial constraints fitted into the model, the ranges of attributes have become wider as the model needs to compromise some homogeneity in attribute space.

  • We also notice more outliers across all four clusters, compared to outliers found in 2 out of four clusters in the plot of non-spatially constrained hierarchical clustering method.

  • Cluster 4 tends to have lowest percentages across all three attributes: percentage of non-functional water points, percentage of rural water points, percentage of water points with hand pumps.

  • Cluster 1 tends to have high percentage of rural water point and water points with hand pumps while having moderately low percentage of non-functional water points.

10 Spatially Constrained Clustering: ClustGeo Method

In this section, we will also perform spatially constrained hierarchical cluster analysis using ClustGeo Method.

10.1 Ward-like hierarchical clustering: ClustGeo

ClustGeo package provides function called hclustgeo() to perform a typical Ward-like hierarchical clustering just like hclust() used in the previous section.

In the below code chunk, we will provide the hclustgeo() function dissimilarity matrix in the attribute space proxmat (which has been derived previously) in order to perform non-spatially constrained hierarchical clustering as shown in the code chunk below. Note that the dissimilarity matrix must be object of class dist, i.e. an object obtained with the function dist().

nongeo_cluster <- hclustgeo(proxmat)
plot(nongeo_cluster, cex = 0.1
    ,hang=-1)
rect.hclust(nongeo_cluster, 
            k = 4, 
            border = 2:5)

10.2 Mapping the clusters formed

We can plot the clusters on a categorical area shaded map by using the below code chunks:

groups <- as.factor(cutree(nongeo_cluster, k=4))
nga_sf_ngeo_cluster <- cbind(nga_wp, as.matrix(groups)) %>%
  rename(`CLUSTER` = `as.matrix.groups.`)
qtm(nga_sf_ngeo_cluster,"CLUSTER")

The code chunks above perform 3 operations to form the join:

  • as.matrix() function converts the groups list object into a matrix; and

  • cbind() function of base R appends groups matrix after conversion onto the nga_wp object by columns to produce a simple feature object called nga_wp_ngeo_cluster; and

  • rename() function of dplyr package renames as.matrix.groups. field to CLUSTER.

As this is a non-spatially constrained hierarchical clustering method, we notice that the plotted choropleth map still shows very fragmented clusters. We will then move on to perform spatially constrained hierarchical clustering.

10.3 Spatially Constrained Hierarchical Clustering

Before performing spatially constrained hierarchical clustering, we will need to derive a spatial distance matrix by using st_distance() function of sf package. st_distance() will return a dense numeric matrix of dimensions length(x) by length(y). as.dist() is used to convert the data frame into matrix.

dist <- st_distance(nga_wp, nga_wp)
distmat <- as.dist(dist)

Next, choicealpha() will be used to determine a suitable value for the mixing parameter alpha as shown in the code chunk below.

cr <- choicealpha(proxmat, distmat, range.alpha = seq(0, 1, 0.1), K=4, graph = TRUE)

With reference to the two graphs above, we pick alpha as 0.4. At 0.4, spatial contiguity increased to 70% while the homogeneity in attribute space was compromised by around 20%.

clustG <- hclustgeo(proxmat, distmat, alpha = 0.4)

Next, cutree() function is used to derive the cluster object.

groups <- as.factor(cutree(clustG, k=4))

We will then join back the group list with the nga_wp polygon feature data frame by using the below code chunks. Similar to the previous section, we rename as.matrix.groups to CLUSTER.

nga_sf_Gcluster <- cbind(nga_wp, as.matrix(groups)) %>%
  rename(`CLUSTER` = `as.matrix.groups.`)

We now can plot the map of the newly delineated spatially constrained clusters.

qtm(nga_sf_Gcluster,"CLUSTER")

We notice how the choropleth map has been less fragmented with visible clusters.

For easy comparison, it will be better to place both the hierarchical clustering and spatially constrained hierarchical clustering maps by ClustGeo method next to each other. In the below code chunks, we will be plotting an interactive map. Note that tmap_mode() will need to be set to “view” option.

tmap_mode("view")
hclustgeo.map<-qtm(nga_sf_ngeo_cluster,"CLUSTER")+
  tm_borders(alpha=0.5)+
  tm_view(set.zoom.limits = c(5,10))
shclustgeo.map<-qtm(nga_sf_Gcluster,"CLUSTER")+
  tm_borders(alpha=0.5)+
  tm_view(set.zoom.limits = c(5,10))
tmap_arrange(hclustgeo.map,shclustgeo.map,asp=NA,ncol=2,sync=TRUE)

We then need to set the option to tmap_mode() back to “plot”.

tmap_mode("plot")

We notice that compared to those of the non-spatially constrained clustering method under ClustGeo method, the spatially constrained clustering method provided more homogeneity in terms of contiguity space. Notice that the choropleth map has been more spatially homogeneous with some pockets of fragments as the alpha derived above balances the homogeneity between attribute space and spatial space.

10.4 Visual Interpretation of Clusters

10.4.1 Multivariate Visualization

In the code chunks below, we use ggparcoord() function GGally package to plot parallel coordinate plot to reveal clustering variables by cluster. Do note that the data used is nga_sf_ngeo_cluster, which contains data on non-spatially constrained hierarchical clustering using ClustGeo approach.

ggparcoord(data = nga_sf_ngeo_cluster, 
           columns = c(15,20,21), 
           scale = "globalminmax",
           alphaLines = 0.1,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of Pct of Water points Variables by Non-Spatially Constrained Cluster") +
  facet_grid(~ CLUSTER) + 
  theme(axis.text.x = element_text(angle = 90))

From the above plot, we observe the following points:

  • Cluster 3 had the lowest percentage of water points with hand pumps while having very high percentage of rural water points.

  • Cluster 4 tends to have the highest percentage of rural water points (except for some outliers hovering at around 0.5 mark) and highest percentage of water points with hand pumps while having the lowest percentage of non-functional water points.

  • Cluster 1’s observations tend to the lowest percentage of rural water points while having relatively low median percentage of non-functional water points and water points with hand pumps. However, we also notice that the range of percentage of non-functional water points and percentage of water points with hand pumps are very wide, ranging from 0 to 1.

  • LGA under Cluster 2 tend to have high percentage of rural water points while having moderately high percentage of water points with hand pumps.

We also plot another parallel coordinate plot to reveal clustering variables by spatially constrained hierarchical clusters using ClustGeo approach.

ggparcoord(data = nga_sf_Gcluster, 
           columns = c(15,20,21), 
           scale = "globalminmax",
           alphaLines = 0.1,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of Pct of Water points Variables by Spatially Constrained Cluster") +
  facet_grid(~ CLUSTER) + 
  theme(axis.text.x = element_text(angle = 90))

We notice that across all 4 clusters, range of values for the three attributes under spatially constrained hierarchical clustering method has become wider compared to those of non-spatially constrained hierarchical clusters. Observations under Cluster 1 tend to have moderately low percentage of non-functional water points, percentage of rural water points and percentage of water points with hand pumps. Cluster 3’s LGA tend to have high percentage of rural water points and water points with hand pumps while having moderately low percentage of non-functional water points (which is pretty similar to the characteristics of cluster 4 under non-spatially constrained clustering above). On the other hand, cluster 2 observations tend to have higher percentage of rural water points while having moderately low percentage of hand pumps.

11 Limitations

As mentioned during the data wrangling process, regions without water points have been excluded from this exercise. Some may argue that these regions (except Bakassi as it belongs to Cameroon) can be grouped into another cluster on its own in order to complete the whole Nigeria map.

12 Reference

Runfola, D. et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866

Access data. WPdx. (n.d.). Retrieved November 30, 2022, from https://www.waterpointdata.org/access-data/

Jordan, O. (2022) Geospatial Analytics for Social Good-Understanding Nigeria Water functional and non-functional water point rate. Retrieved from https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#data-wrangling